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Abstract 

Background: Batch effect is one type of variability that is not of primary interest but ubiquitous in sizable genomic 
experiments. To minimize the impact of batch effects, an ideal experiment design should ensure the even 
distribution of biological groups and confounding factors across batches. However, due to the practical 
complications, the availability of the final collection of samples in genomics study might be unbalanced and 
incomplete, which, without appropriate attention in sample-to-batch allocation, could lead to drastic batch effects. 
Therefore, it is necessary to develop effective and handy tool to assign collected samples across batches in an 
appropriate way in order to minimize the impact of batch effects. 

Results: We describe OSAT (Optimal Sample Assignment Tool), a bioconductor package designed for automated 
sample-to-batch allocations in genomics experiments. 

Conclusions: OSAT is developed to facilitate the allocation of collected samples to different batches in genomics 
study. Through optimizing the even distribution of samples in groups of biological interest into different batches, it 
can reduce the confounding or correlation between batches and the biological variables of interest. It can also 
optimize the homogeneous distribution of confounding factors across batches. It can handle challenging instances 
where incomplete and unbalanced sample collections are involved as well as ideally balanced designs. 



Background 

A sizable genomics study such as microarray often 
involves the use of multiple batches (groups) of experi- 
ment due to practical complication. The systematic, non- 
biological differences between batches in genomics experi- 
ment are referred as batch effects. Batch effects are wide- 
spread occurrences in genomic studies, and it has been 
shown that noticeable variation between different batch 
runs can be a real concern, sometimes even larger than 
the biological differences [1-5]. Without sound experiment 
designs and statistical analysis methods to handle batch 
effects, misleading or even erroneous conclusions could 
be made. This especially important issue is unfortunately 
often overlooked, partially due to the complexity and mul- 
tiple steps involved in genomics studies. 

To minimize the impact of batch effects, a careful 
experiment design should ensure the even distribution 
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of biological groups and confounding factors across 
batches. It would be problematic if one batch run con- 
tains most samples of a particular biological group. In 
an ideal genomics design, the groups of the main inter- 
est, as well as important confounding variables should 
be balanced and replicated across the batches to form a 
Randomized Complete Block Design (RCBD) [6-8]. It 
makes the separation of the real biological effect of our 
interests and effects by other confounding factors statis- 
tically more powerful. 

However, despite all best effort, it is often than not 
that the collected samples are not complying with the 
original ideal RCBD design. This is due to the fact that 
these studies are mostly observational or quasi- 
experimental since we usually do not have full control 
over sample availability [1]. In clinical genomics study, 
samples may be rare, difficult or expensive to collect, 
irreplaceable or fail QC before profiling. The resulted 
unbalance and incompleteness nature of sample avail- 
ability in genomics study, without appropriate attention 
in sample-to-batch allocation, could lead to drastic batch 
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effects. Therefore, it is necessary to develop effective and 
handy tool to assign collected samples across batches in 
an appropriate way in order to minimize the impact of 
batch effects. 

We developed OSAT to facilitate the allocation of col- 
lected samples into different batches in genomics stud- 
ies. OSAT is not aimed to be a software for 
experimental design carried out before sample collec- 
tion, rather, it is developed to fulfill the needs arise from 
some practical limitations occurring in the genomics 
experiments. Specifically, OSTA is developed to address 
one practical issue in genomics studies - when the avail- 
able experimental samples ready to be profiled in the 
genomics instruments are collected, how should one al- 
locate these samples to different batches in a proper way 
to achieve an optimal setup minimizing the impact of 
batch effects at the genomic profiling stage? With a 
block randomization step followed by an optimization 
step, it produces setup that optimizes the even distri- 
bution of samples in groups of biological interest into 
different batches, reducing the confounding or correl- 
ation between batches and the biological variables of 
interest. It can also optimize the even distribution of 
confounding factors across batches. OSAT can handle 
challenging instances where incomplete and unba- 
lanced sample collections are involved as well as ideal 
balanced RCBD. 

Results 

Datasets 

An exemplary data is used for demonstration. It repre- 
sents samples from a study where the primary interest is 
to investigate the expression differentiation in case ver- 
sus control groups (variable SampleType). Two add- 
itional variables. Race and AgeGrp, are clinically 
important variables that may have impact on final out- 
come. We consider them as confounding variables. A 
total of 576 samples are included in the study, with one 
sample per row in the example file. As shown in Add- 
itional file 1: Table S1-S2, none of the three variables 
are characterized by balanced distribution. 

Comparison of different sample assignment algorithms 

The default algorithm implemented in OSAT will first 
block three variables considered (i.e., SampleType, Race 
and AgeGrp) to generate a single initial assignment 
setup, and then identify the optimal one with most 
homogeneous cross-batch strata distribution through 
shuffling the initial setup. Alternatively, if blocking the 
primary variable (i.e., SampleType) is the most important 
and the optimization of the other two variables is less 
important (but desired), a different algorithm implemen- 
ted in OSATcan be used. It works by first blocking Sam- 
pleType only to generate a pool of assignment setups. 



and then select the optimal one with most homogeneous 
cross-batch strata (i.e., SampleType, Race and AgeGrp) 
distribution. 

As shown in Figure la-c, the final setup produced by 
the default algorithm is characterized by relatively uni- 
form distribution of all three variables across the 
batches. Pearson's \ test examining the association be- 
tween batches and each of the variables considered indi- 
cate that all there variables considered are highly 
uncorrelated with batches (p-value > 0.99, Table 1). On 
the other hand, as shown in Figure 2a-c, the final setup 
produced by the alternative algorithm is characterized 
by almost perfectly uniform distribution of SampleType 
variable (with small variation only due to the inherent 
limitation of the starting data such as unbalanced sample 
collection), with the uniformity of the other two vari- 
ables not included in block randomization step 
decreased. Pearson's test (Table 1) shows that the 
resulting chi-square for SampleType decreases while 
those for Race and AgeGrp increase, indicating the tra- 
deoff in prioritizing variable of primary interest for block 
randomization. Nevertheless, as shown in Figure Id and 
Figure 2d, both algorithms produce final setups which 
show more homogeneous cross-batch strata distribution 
than the corresponding starting ones. 

Simply performing complete randomizations might 
lead to undesired sample-to-batch assignment, especially 
for unbalanced and/or incomplete sample sets. In fact, 
there is substantial chance that variables will be statisti- 
cally dependent on batches if a complete randomization 
is carried out, especially for incomplete and/or unba- 
lanced sample collections. As shown in Figure 3, an un- 
desired setup can be produced through complete 
randomization of sample-to-batch assignment. The Pear- 
son's tests indicate all three variables are statistically 
dependent on batches with p-values < 0.05 (Table 1). 

Conclusions 

Genomics experiments are often driven by the availabil- 
ity of the final collection of samples which might be 
unbalanced and incomplete. The unbalance and incom- 
pleteness nature of sample availability thus calls for the 
development of effective tools to assign collected sam- 
ples across batches in an appropriate way in order to 
minimize the impact of batch effects at the genomics ex- 
periment stage. OSAT is developed to facilitate the allo- 
cation of collected samples to different batches in 
genomics study. With a block randomization step fol- 
lowed by an optimization step, it produces setup that 
optimizes the even distribution of samples in groups of 
biological interest into different batches, reducing the 
confounding or correlation between batches and the 
biological variables of interest. It can also optimize 
the homogeneous distribution of confounding factors 
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Figure 1 Summary of final setup produced by the default algorithm, a) the distribution of SampleType ciiaracteristic across tlie plates; b) tine distribution 
of Race characteristic across the plates; c) the distribution of AgeGrp characteristic across the plates; d) the index of optimization steps versus value of 
the objective function. The blue diamond indicates the starting point, and the red diamond marks the final optimized setup. 



across batches. While motivated to handle challenging 
instances where incomplete and unbalanced sample col- 
lections are involved, OSAT can also handle ideal 
balanced RCBD. 

Partly due to its simplicity in implementation, complete 
randomization has been frequendy used in the sample as- 
signment step of experiment practice. When sample size 
is large enough, randomized design will be close to a 
balanced design. However, simple randomization could 
lead to undesirable imbalanced design where efficiency 
and confounding might be an issue after the data collec- 
tion. As we demonstrated in the manuscript, simply per- 
forming randomizations might lead to undesired sample- 
to-batch setup showing batch dependence, especially for 
unbalanced and/or incomplete sample sets which doesn't 
comply with the original ideal design. OSAT package is 
designed to avoid such scenario, by providing a simple 
pipeline to create sample assignment that minimizes the 
association between sample characteristics and batches. 



The software was implemented in a flexible way so that it 
can be adopted by genomics practitioner who might not 
be specialized in experiment design. 

It should be emphasized that although the impact of 
batch effect on genomics study might be minimized 
through proper design and sample allocation, it may not 
be completely eliminated. Even with perfect design and 
best effort in all stages of experiment including sample- 
to-batch assignment, it is impossible to define or control 
all potential batch effects. Many statistical methods have 
been developed to estimate and reduce the impact of 
batch effect at the data analysis stage {i.e., after the ex- 
periment part is done) [1,9-12]. It would be helpful that 
analytic methods handling batch effects are employed in 
all stages of a genomics study, from experiment design 
to data analysis. 

Experimental design has been applied in many areas, 
with methods being tailored to the needs of various 
fields. A collection of R packages for experimental 



Table 1 Comparison of sample assignment by two algorithms Implemented in OSAT and an undesired sample 
assignment through complete randomization 



Variable 


DF 


Default algorithm 
(optimal.shuffle) 




Alternative algorithm 
(optimal.block) 


An undesired setup through 
complete randomization 


Chi-square 


P value 


Chi-square 


P value 


Chi-square 


P value 


SampleType 


5 


0.2034518 


0.9990763 


0.03507789 


0.9999879 


13.25243 


0.021124664 


Race 


5 


02380335 


0.9986490 


3.68541503 


0.5955359 


14.22455 


0.014244218 


Age_grp 


20 


0.8138166 


1 .0000000 


5.08147313 


0.9996856 


39.75020 


0.005371387 
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Figure 2 Summary of final setup produced by the alternative algorithm, a) the distribution of SampleType characteristic across the plates; b) 
the distribution of Race characteristic across the plates; c) the distribution of AgeGrp characteristic across the plates; d) the index of generated setups 
versus value of the objective function. The blue diamond indicates the first setup generated, and the red diamond marks the final selected setup. 



design is available at http://cran.r-project.org/web/views/ 
ExperimentalDesign.html. Many of these existing experi- 
ment design software work for ideal situation (i.e., before 
sample collection) where the sample size is fixed and/or 
model is specified. For example, the software in above 



link includes optimal design (e.g. AlgDesign, requiring 
model specification), orthogonal arrays for main effects 
experiments (e.g., function oa.design, constrained by 
sample size/number of factors), factorial 2-level designs 
(e.g.. Package FrF2, particularly important in industrial 
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Figure 3 Summary of an undesired setup produced by complete randomization, a) the distribution of SampleType characteristic across the 
plates; b) the distribution of Race characteristic across the plates; c) the distribution of AgeGrp characteristic across the plates. 
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experimentation), and etc. We developed OSAT to facili- 
tate the allocation of collected samples into different 
batches in genomics studies. Our software implements 
the general experiment design methodology to achieve 
the optimal sample-to-batch assignment in order to 
minimize the impact of batch effects. It is specifically 
used in the profiling stage of a genomics study when the 
available experimental samples ready to be profiled in 
the genomics instruments are collected. It provides pre- 
defined batch layout for some of the most commonly 
used genomics platforms. Written in a modularized style 
in the open source R environment, it provides the flexi- 
bility for users to define the batch layout of their own 
experiment platform, as well as optimization objective 
function for their specific needs, in sample-to-batch as- 
signment in order to minimize the impact of batch 
effects. To our best knowledge, there is no other tool for 
this important utility within the framework of 
Bioconductor. 

Methods 

Methodology 

The current version of OSAT provides two algorithms 
for creation of sample assignment across the batches 
based on the principle of block randomization, which is 
an effective approach in controlling variability from 
nuisance variables such as batches and its interaction 
with variables of our primary interest [6-8,13]. Both 
algorithms are composed of a block randomization step 
and an optimization step. The default algorithm (imple- 
mented in function optimal.shuffle) sought to first block 
all variables considered to generate a single initial as- 
signment setup, then identify the optimal one which 
minimizes the objective functions {i.e., the one with 
most homogeneous cross-batch strata distribution) 
through shuffling the initial setup. The alternative algo- 
rithm (implemented in function optimal, blcok) sought to 
first block specified variables (e.g., list of variables of pri- 
mary interests) to generate a pool of assignment setups, 
then select the optimal one which minimize the object- 
ive functions based on all variables considered (including 
those variables which are not included in the block 
randomization step). A detailed description is provided 
as below. 

By combining the variables of interest, we can create a 
unified variable with its levels based on all possible com- 
binations of the levels of the variables involved. Assum- 
ing there are a total of s levels in the unified variable 
(referred as optimization strata in this package) with Sj 
samples in each stratum, / = 1 ... 5, and assuming we 
have m batches with Bi, i = 1. . . m wells available in each 
batch. In an ideal balanced RCBD experiment, we have 
equal sample size in each strata: 5j = . . .= Ss = S, and 
each batch includes the same number of available wells, 



Bj = . . . = Bm = B, with equal number of samples from 
each sample strata. 

The expected number of sample from each stratum to 
each batch is denoted as Ejj. One can split it to its inte- 
ger part and fractal part as 

Eij = = [Eij\ + Sij 

where L-Ei/J is the integer part of the expected number 
and 5ij is the fractal part. In the case of equalbatch size, 

it reduces to = ^ . When we have RCBD, all <5,y are 
zero. 

For an actual sample assignment 
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where «,y is the number of sample in each 
optimization strata from an actual sample assignment. 
Our goal is, through a block randomization step and 
an optimization step, to minimize the difference be- 
tween expected sample size Eij and the actual sample 
size 

The block randomization step is to create initial 
setup(s) of randomized sample assignment based on 
strata combining the blocking variables considered. 
The blocking variables include all variables of interests 
in the default algorithm, but only a specified subset of 
variables in the alternative algorithm. 

In this step, we sample i sets of samples from each 
strata S, with size lE^j], as well as / sets of wells 
from each Bj batches with size of lEij]. The two 
selections are linked together by the ij subgroup, 
randomized in each of them. The rest of samples r, = 
5; - Yl ilEij] can be assigned to the available wells in 
each Block Wi = Bi-^jiEij]. The probability of a 
sample in r, from strata Sj being assigned to a well 
from block 5, is proportional to the fractal part of 
the expected sample size Sij. For a RCBD, each batch 
will have equal number of samples with same charac- 
teristic and there is no need for further optimization. 
However, for other instances where the collection of sam- 
ples is unbalanced and/or incomplete, an optimization 
step is needed to create a more optimal setup of sample 
assignment. 

The optimization step aims to identify an optimal 
setup of sample assignments from multiple candi- 
dates. To select optimal sample assignment, we need 
to measure the variation of sample characteristics be- 
tween batches. In this package, we define the optimal 
design as a sample assignment setup that minimizes 
our objective function based on principle of least 
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square method [13]. The objective function can be 
defined as 

y 

where £y and «,y were defined previously. 

In the default algorithm implemented in OSAT, 
optimization is conducted through shuffling the initial 
setup obtained in the block randomization step. Specific- 
ally, after initial setup is created, we randomly select k 
samples from different batches and shuffle them be- 
tween batches to create a new sample assignment. Value 
of the objective function is calculated for the new setup 
and compared to that of the original one. If the new 
value is smaller, the new assignment will replace the pre- 
vious one. This procedure will continue until we reach a 
pre-set number of attempts (5000 by default). 

In the alternative algorithm, multiple (typically thou- 
sands of or more) sample assignment setups are first 
generated by procedure described in the block 
randomization step above, based only on the list of spe- 
cified blocking variable(s). The optimal one will be 
chosen by selecting the setup (from the pool generated 
in the block randomization step) which minimizes the 
value of the objective function based on all variables 
considered. This algorithm will guarantee the identifica- 
tion of a setup that is conformed to the blocking re- 
quirement for the list of specified blocking variables, 
while attempting to minimize the between-batches varia- 
tions of the other variables considered. 

Implementation 

We provide a brief overview of the OSAT usage as 
below. A more detailed description of package function- 
ality can be found in the package vignette and manual. 

Data format 

To begin, sample variables to be considered in the 
sample-to-batch assignment will be encapsulated in an 
object using function 

sample<- setup.sample (x, optimal, . . .) 

where in data frame x each sample is represented by a 
row and category variables including our primary inter- 
est and other variables are listed as columns. The par- 
ameter optimal indicates the vector of variables to be 
considered. 

Batch layout 

Next, the number of plates to be used in the genomic 
experiment, the layout design of these plates, and the 
level of batch effect to be considered are captured in a 



container object using constructor function 

Container <- setup.container(plate, n, batch, . . .) 
where parameter plate is an object representing the lay- 
out (number and type of chip used, rows and columns 
of wells, the ordering of them, and etc.) of the plate used 
in the experiment. Layouts of some commonly used 
plates and chips are predefined in our package {e.g., the 
IlluminaBeadChip Plate). The user can define their own 
layout using the classes and methods provided in OSAT. 
Optional parameter batch has default value "plates", in- 
dicate batch effect will be considered at the plate level. 
User can use batch="chips" to consider batch effect at 
chip level. 

Block randomization and optimization 

Third, sample-to-batch assignment can be created 
through function 

create.optimized.setup(fun="optimal.shuffle",sample, 
container, . . .) 

The default algorithm is implemented in function opti- 
malshuffle, while the alternative algorithm is implemen- 
ted in function optimal.blcok. Users can also define 
objective function following the instruction in the pack- 
age vignette. 

Output 

Last, bar plot of sample counts by batches for all vari- 
ables considered is provided for visual inspection of the 
sample assignment. Chi-square tests are also to examine 
the dependence of sample variables on batches. The final 
sample-to-batch assignment can be output to CSV. 

Availability and requirements 

Project name: OSAT 

Project home page: http://bioconductor.org/packages/ 
2.11/bioc/html/OSAT.html 

Operating system(s): Windows, Unix-like (Linux, Mac 
OSX) 

Programming language: R >= 2.15 
License: Artistic-2.0 
Any restrictions to use by non-academics: None 

Additional file 

/ ^ 

Additional file 1: Table SI. Example data. Table S2. Data distribution. 
Figure SI. Number of samples per plate. Paired specimens are placed on 
the same chip. Sample assignment use optimal. block method. 
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